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Abstract — We develop the pruned continuous Haar transform 
and the fast continuous Fourier series, two fast and efficient 
algorithms for rectilinear polygons. Rectilinear polygons are 
used in VLSI processes to describe design and mask layouts 
of integrated circuits. The Fourier representation is at the heart 
of many of these processes and the Haar transform is expected 
to play a major role in techniques envisioned to speed up VLSI 
design. To ensure correct printing of the constantly shrinking 
transistors and simultaneously handle their increasingly large 
number, ever more computationally intensive techniques are 
needed. Therefore, efficient algorithms for the Haar and Fourier 
transforms are vital. 

We derive the complexity of both algorithms and compare it to 
that of discrete transforms traditionally used in VLSI. We find a 
significant reduction in complexity when the number of vertices 
of the polygons is small, as is the case in VLSI layouts. This 
analysis is completed by an implementation and a benchmark 
of the continuous algorithms and their discrete counterpart. We 
show that on tested VLSI layouts the pruned continuous Haar 
transform is 5 to 25 times faster, while the fast continuous Fourier 
series is 1.5 to 3 times faster. 

Index Terms — Continuous transform, Fast Haar and Fourier 
algorithms, 2D rectilinear polygons, VLSI 



I. Introduction 

MOORE'S famous law from 1965 [1] has been a major 
driving force in the effort to shrink transistors in Very 
Large Scale Integration (VLSI). The tremendous progress in 
optical lithography has been the enabler for massive produc- 
tion of integrated circuits and a dramatic decrease in unit cost. 

In optical lithography [2], patterns of the integrated circuits 
are transferred to silicon by shining light through a mask 
and subsequently using a lens to concentrate the light onto a 
photosensitive layer, as depicted in Fig. 1. This is followed by 
an etching step, which transfers the pattern to the silicon. The 
Fourier transform of the mask is extensively used in simulation 
of the lithography process owing to Fourier transforming 
properties of lenses [3]. The projected image is the convolution 
of the mask with an ideal filter with frequency response 
H(f,g) = 1 if y/f 2 + g 2 < ?f and otherwise, where / 
and g are the spatial frequencies, NA the numerical aperture 
of the lens and A the wavelength of the source [4] . The size of 
the smallest feature printable by straightforward means with 
such a lithography system is 0.25X/NA [5]. 
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The smallest feature in integrated circuits, also called tech- 
nology node, has shrunk from a few micrometers in the early 
days of optical lithography to 32nm for the latest commercially 
available technology. The next industry target is to print 22nm 
with sufficient yield. One way to achieve this target is to 
develop extreme ultra-violet light source [6], which, with a 
wavelength of 15nm, would enable reliable production of 
future technology nodes. However, it will seemingly not be 
ready on time for the 22nm node, and remains to be proved a 
viable technology. This means the current state of the art light 
source with a wavelength of 193nm has to be used for the 
22nm node. Along with a numerical aperture of NA = 1.44 
in modern lithography systems, this sets the smallest feature 
possible at 34nm. 

For the 32nm technology node, the system has already been 
pushed past its apparent limit. At 22nm, the system is operat- 
ing far below its limit, resulting in severe optical degradation 
due to printed shapes being much smaller than allowed by the 
current wavelength of the light. As a consequence, ever more 
burden is placed onto computationally intensive techniques to 
circumvent the optical degradation, and thus ensure sufficient 
manufacturing yield. These techniques, collectively known 
as computational lithography, include traditional resolution 
enhancement [4], source-mask optimization [7], and inverse 
lithography [8]. They strive to exploit all degrees of freedom 
in the lithography process, including illumination amplitude, 
direction and phase [5], All of these techniques rely on com- 
putationally intensive simulation of the underlying physical 
processes. An alternative is to replace the simulation altogether 
by a statistical model. For example, Kryszczuk et al. [9] 
avoid costly physical simulation by directly predicting the 
printability of the layouts using a classifier trained with feature 
vectors from orthogonal transforms. 

VLSI layout file sizes are expanding rapidly, with an in- 
creasing number of transistors packed into a single design. 
They can reach more than a terabyte for a single layer in 
technology nodes under 40nm [10]. This coincides unfortu- 
nately with the increasing complexity of the aforementioned 
computational lithography algorithms. Taking these factors 
into account, having highly efficient algorithms at various steps 
of the lithography process becomes crucial. 

VLSI layouts consist primarily of highly repetitive patterns 
of rectilinear polygons, those containing only right angles. 
Their vertex description, although very compact, is cumber- 
some to use for applications where a fixed-length represen- 
tation is required, in particular machine learning. However, 
they are also very sparse in the Haar basis since the building 
blocks of its basis functions are themselves rectangles. The 
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Fig. 1. Basic principle of optical lithography. Patterns of the integrated 
circuit are transferred to the silicon wafer by shining light through a mask 
and subsequently using a lens to concentrate the light onto a photosensitive 
layer. 

Haar transform is thus a potentially excellent candidate to 
provide fixed-length features for any post-processing envi- 
sioned in computational lithography. To extract Haar transform 
coefficients from such enormous amount of vertex descriptions 
of polygons, one needs a fast algorithm. 

The Fourier representation is crucial in computational 
lithography, especially in simulation of the lithography pro- 
cess. The underlying physical processes is continuous, and 
using the continuous Fourier series (CFS) would seem natural. 
However, a discretization step, comprising the sampling of 
the polygons, followed by a fast Fourier transform is instead 
standard practice. Unfortunately, sampling introduces aliasing 
and thus reduces overall accuracy. CFS does not suffer from 
aliasing. However, a fast and efficient algorithm is required to 
make the use of CFS practical given the extreme amount of 
data to process in VLSI layouts. 

Despite its close affinity with rectilinear polygons, the Haar 
transform has seen surprisingly little use in lithography-related 
algorithms. In 1985, Haslam et al. used a discrete Haar 
transform to compress the Fourier precompensation filters 
for electron-beam lithography [11], whereas Ma et al. used 
the Haar transform in inverse lithography to regularize the 
obtained mask [12]. When only a sampled signal is available, a 
continuous transform cannot be computed. Even the so-called 
continuous wavelet transform is computed as the projection 
of a discrete signal onto an overcomplete basis [13]. For 
polygons it is however possible to compute the inner products 
with continuous basis functions. In [14], we introduced the 
pruned continuous Haar transform (PCHT), a fast algorithm 
to compute the continuous Haar transform coefficients. We 
further study this algorithm in more detail in this paper. 

As mentioned, the Fourier transform is extensively used 
at various levels of the lithography process. Uses of the 
fast Fourier transform (FFT) include the computation of a 
precompensation filter to reduce proximity effects [15], and 
approximating the diffraction orders of the mask [7]. In [16], 
a road map for the efficient use of the 2D FFT in computational 
lithography is developed. The mathematical idea to compute 
the CFS of polygons in itself is not new. In 1983, Lee and 
Mittra used geometry to derive a closed-form expression for 
the Fourier integral over a polygon [17]. A few years later, Chu 
and Huang proposed a new and elegant derivation based on 
Stokes' theorem [18]. More recently, both Brandoli et al. [19] 
and Lu et al. [20] extended the solutions to the A^-dimensional 
case using the divergence theorem. 

The inherently continuous nature of polygons makes a 



continuous transform a natural tool for VLSI layouts. To 
the best of our knowledge, algorithms making efficient use 
of this type of representation to quickly compute continuous 
transform coefficients do not exist. We present in this paper 
the first fast and efficient continuous Haar transform and 
Fourier series algorithms applied to rectilinear polygons from 
VLSI layouts. They are based on a closed-form formula that 
we derive for 2D continuous separable transforms using a 
decomposition of rectilinear polygons into rectangles. This 
formulation allows us to derive two fast algorithms to compute 
the transform coefficients: PCHT, first introduced in [14], 
and the fast continuous Fourier series (FCFS) algorithms. 
PCHT has a fast orthogonal wavelet transform structure that 
is pruned using computational geometry techniques. FCFS 
results from reducing the CFS computation problem to a 
few sparse discrete Fourier transforms (DFT) computed using 
pruned FFT algorithms. We evaluate the complexity of both 
algorithms and the performance of their implementation on 
real VLSI layouts relative to their discrete counterparts. We 
find PCHT and FCFS to be up to 25 and 3 times, respectively, 
faster than their discrete counterparts. 

This paper is organized as follows. Section II introduces the 
necessary background in VLSI layouts, the continuous Haar 
transform (CHT), and the CFS. Section III presents a frame- 
work for taking continuous transforms of rectilinear polygons 
as well as the proposed PCHT and FCFS. In Section IV, the 
performance of both algorithms is evaluated and compared to 
that of their discrete counterparts. Finally, Section V concludes 
by discussing the superiority of PCHT and FCFS over their 
discrete equivalents, and sketches possible future directions. 

II. Background 

In this section, we first briefly describe VLSI layouts and 
how they are created. The rectilinear polygons composing 
VLSI layouts are described mathematically, laying the ground- 
work for the algorithms to come. We then describe the 2D 
continuous Haar transform and the fast wavelet transform 
algorithm used to compute it. Finally, a short refresher on 
the CFS is given. 

A. VLSI Layouts 

1) Layouts: VLSI layouts are composed of millions of 
rectangles, or more generally, rectilinear polygons, which 
come from the transformation of elements of a functional 
electric circuit. This data is then used for tasks such as mask 
generation and printing using an optical lithography system. 

Such a layout typically consists of several layers of various 
types. Fig. 2 shows fragments of three different types of layers. 
These are taken from Metal 1 (Ml), which is mostly random 
logic, Metal 2 (M2), containing some logic and wires and 
Contact Array (CA), providing contacts between the different 
layers. In addition, there are several other types of layers, 
omitted here, similar to those of Ml, M2 or CA. 

2) Rectilinear Polygons: We now give a mathematical 
definition of the polygons in VLSI layouts. They are recti- 
linear (only right angles), simple (edges do not intersect, no 
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Fig. 2. From left to right: Examples of 1024nmx 1024nm tiles from Metal 
1, Metal 2 and Contact Array layers, respectively. Note that they exclusively 
contain rectilinear polygons. 




Fig. 3. On the left, a rectilinear simple lattice polygon. The interior of 
the polygon is shaded. The full-lined edges are included, while the dashed 
edges are not. Arrows indicate vertex ordering. On the right, illustration of 
the construction of the left polygon from disjoint rectangles. The minus here 
stands for the set difference operator. 

holes), lattice (vertices are on the integer lattice) polygons. An 
example of such a polygon is shown in Fig. 3. 

A standard layout description consists of polygons defined 
by the set of the ordered coordinates of their K vertices 

{(xo,yo), ■ ■ ■ , (xk-i,Vk-i)} , (xi,yi)eZ 2 . (1) 

This set separates the plane in two: inside and outside of the 
polygon. A sequence order is also needed, which we arbitrarily 
choose to be clockwise. A disjoint partition of the plane is 
achieved when we exclude edges if the interior of the polygon 
is on their left or bottom and include them if the interior is 
on the right or top (see Fig. 3). 

Rectangles are rectilinear polygons with four vertices, and 
as they are simpler to handle, we treat them as a special case. 
A rectangle 1Z can be defined by its lower left and upper right 
vertices 

^£3 = y)\xi<x<x 2 ,yi<y< y 2 }. 

The following result, illustrated in Fig. 3, makes it easy to 
take inner products over rectilinear polygons. 

Proposition 1: Any rectilinear polygon V with K vertices 
as in (1) can be expressed by K/2 disjoint rectangles, each 
with a side lying on the x-axis 

p= u u ( 2 ) 

{i\x i+1 >Xi\ \{i\x i+ i<Xi} 

where the indices are taken modulo K and \ is the set 
difference operator. 

An equivalent result exists for the case where a side is lying 
on the y-axis. 

So far polygons have been defined as subsets of R 2 , and 
are not yet directly transformable as 2D signals on M 2 . This 
is achieved by transforming their indicator function 

fr(x,y) = lv(x,y) 

where V C M 2 is the polygon and ts(x, y) = 1, if (x, y) e S 
and otherwise, is the indicator function of a set S. 
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Fig. 4. A pruned signal flow of the ID Haar FWT The full flow-diagram 
is shown in light gray. The transformed signal is f(t) = u(t — 3), defined 
on [0,8), where u(t) is the Heaviside function. Xj ^ = //, V^fl^ an d 

Cj t k = (f , V^/- Vj 8 fe and i>f1 are the ID Haar basis functions on [0, 8). 
For simplicity, the scaling of the transform coefficients has been omitted. 

B. Continuous Haar Transform 

The 2D Haar basis over T = [0, N x ) x [0, N y ) is 

{ ^0,0,0 , ^klky ' ^jtlky . ^klky } (3) 

where j e N and k Xl k y <G {0, . . . , V — 1}. In practice, j 
is limited to some maximum level of decomposition J. The 
scaling function is 

where ip(t) = lif0<f<l and otherwise. The three other 
basis functions can be defined using a recursive relationship. 
For example 

i>j h k!,k y ( x >y) = ^2^2h n g m <p J+1 , 2k:c +n,2k y + m {x,y) 

n m 

where g n = [2- 1 /2 2 -i/2] and hn = [ 2 -i/2 _ 2 "V2] are 
the Haar filters. By replacing h n g m in the sum by g n gm, 

g n h m and h n h m we obtain ipj,k x ,k y , ^klky and ^j,k],ky 
respectively. The dyadic CHT of a function / is given by its 
inner product with the basis functions. The discrete counterpart 
of the CHT is the discrete Haar transform (DHT). For a more 
thorough introduction to the CHT and DHT, see [21]. The 
CHT and DHT coefficients are identical for 2D rectilinear 
polygonal patterns. Both can be computed using the fast 
orthogonal wavelet transform (FWT) [22]. This algorithm has 
a Cooley-Tukey butterfly structure [23], where only the inner 
products with the scaling function at the lowest level need be 
computed. The full flow diagram for a length-8 ID FWT is 
shown in light gray in Fig. 4. Fig. 5 shows one butterfly of 
the 2D transform. 

C. Continuous Fourier Series 

The 2D Fourier basis over T = [0, A^) x [0, N y ) is 

f(N x N v )- 1 / 2 e>( Wxkx+w >' ly '>\ , (4) 

I y ) (k,i)& 2 

where w x — j^- and w y = . The Fourier basis assumes that 
the function / under transformation is periodic with period N x 
along the x-axis and period N y along the y-axis. The CFS 
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Fig. 5. One butterfly of the traditional 2D Haar FWT where Xj tkxik = 

(f,V3,k.,k v ) ™d C^ ky = (f,4>^ k b l ky y 3 e N, k x ,ky e 

{0, . . . , 2 J — 1}, and a, b g {g, h}. Scaling of the transform coefficients 
is omitted for simplicity. 



coefficients F k j are then given by the inner product between 
/ and the Fourier basis functions. 

The main difference with the DFT is that the functions are 
continuous in the spatial domain, and thus not periodic in the 
frequency domain. This means that for a perfect reconstruction 
of the image, an infinite number of coefficients is needed. 
On the other hand, the use of the DFT requires sampling, 
which in turn introduces aliasing, due to the infinite bandwidth 
of rectilinear polygons. In contrast, the CFS yields the true 
spectrum of the continuous image. 

III. Continuous Transforms of Rectilinear 
Polygons 

In this section, we derive the algorithms to compute the 
continuous Haar transform and Fourier series coefficients of 
rectilinear polygons. Their theoretical computational complex- 
ity is evaluated and compared with that of the equivalent 
discrete transforms. 

The main reason continuous transforms can be faster than 
their discrete counterparts for rectilinear polygons is that the 
continuous inner product of a basis function is an explicit func- 
tion of the vertices of the polygon, and the vertex description 
(1) is very sparse compared to the size of the image. Moreover, 
no memory is needed to form or store a discrete image. On the 
other hand, the sampling of the polygons to create a discrete 
image can itself be considered a projection on a Dirac basis. 
Sampling followed by a discrete transform, as illustrated in 
Fig. 6, effectively is two transforms, whereas a continuous 
transform can completely omit the sampling operation. More- 
over, as discussed in the introduction, sampling introduces 
aliasing since the assumption that the signal is bandlimited 
does not hold for rectilinear polygons. 

A. Inner Product over Rectilinear Polygons 

In practice, layouts are divided into smaller disjoint or 
overlapping rectangular tiles before applying a transform to 
each individual tile. Therefore, we consider continuous trans- 
forms over a rectangular subset T = [0,N X ) x [0,N y ) Cl 2 
that we call a tile. This poses no restrictions as the size 
can be increased sufficiently to cover the entire layout. In 
the case of VLSI layouts, N x , N y are always chosen to be 
positive integers. An orthogonal basis of functions cf) k j over T 
can be written as {4>k,i '■ T — > C}?^°, =0 , and the transform 
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Fig. 6. Flow diagram of all the transforms. From top to bottom: fast Fourier 
transform (FFT), discrete Haar transform (DHT), pruned continuous Haar 
transform (PCHT) and fast continuous Fourier series (FCFS). 



coefficients of a function / e L 2 (T) are simply the inner 
products with the basis functions 



(f,(t>k,t)= // f(x,y)(t>* ki i(x,y)dxdy. 



(5) 



As all polygons in a layout are disjoint, the image Jt of 
a tile containing polygons To, . . . , Vm-i Q T is the sum of 
the indicator functions of the polygons 

M-l 

Mx,y) = ^ l Pm {x,y). (6) 

m=0 

Now that we have a signal model, we provide two results that 
will allow us to derive the inner product of /t with a given 
basis function. 

Proposition 2: If fr{x,y) is given by (6) and <fik t i(x,y) is 
a basis function, their inner product according to (5) is 

(/t , 4>k,i) = Y <t>l,i{x,y)dxdy 



X m ,2i 



where (x m i ,y m i ) is the i th vertex, out of the K m , of the 
m" 1 polygon and i is taken modulo K m . Here it is assumed 
without loss of generality that for all m x m .n ^ £ m ,i. 

Proof: By the linearity of the inner product and since the 
polygons are disjoint, we have 



M-l 



(fr , 4>k,i) = Y (f v ™ ' 



m=0 



M-l 

E 

m=0 



4i,i{x,y)dxdy. 

(7) 
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Then, using Proposition 1, we split the integral over V m into 
K m /2 over rectangles. If x m ^ < x m ^ + i, the integral is 
positive and is added to the final result. If x m s > x m ,i+i, 
the integral is negative and is subtracted from the final result. 
If x m ,i = %m,i+i, the I th term of the sum is zero. ■ 

We now give the closed-form formula for (f-p , 4> k ,i) when 
<f>k,i is separable. When there is more than one polygon in the 
tile, we compute the continuous Haar and Fourier transforms 
of /t by simply applying (7). The following lemma is derived 
using Proposition 2, the separability of (j>k,i and the fact that 
J^fJo^ 1 (^fe( x 2i+i) - Ql(x2i)) = for rectilinear polygons 
since the index i is cyclic and vertical edges cancel. The 
function 51 is a primitive of <j>. 

Lemma 1: Consider a rectilinear polygon V with K ver- 
tices, and a separable basis function, namely (f>kj(x,y) = 
<t>k{x)(t>i{y). Then 



(fv , 4>k,i) 



K/2-1 

£ 

i=0 



nt(y 2i )(n* k (x 2i+1 )-nux 2i )), (8) 



where indices are taken modulo K. 

B. Pruned Continuous Haar Transform (PCHT) 

We first describe the PCHT algorithm for rectilinear poly- 
gons, first introduced in [14]. We then advance this previous 
work with the derivation of the complexity of PCHT. 

1) Algorithm Derivation: Consider the FWT as described 
in Section II-B and whose full ID flow diagram is shown in 
light gray in Fig. 4. First, note that the linear complexity of 
the FWT and (7) mean that we can use the FWT on individual 
polygons and sum up the transform coefficients to obtain 
the transform of a tile, as illustrated in Fig. 6. Thus, from 
now on, we consider only the transform of a single polygon. 
Second, we use computational geometry techniques to com- 
pute the inner product. The continuous inner product between 
the indicator of a polygon and the support of the scaling 
function, as given by Lemma 1, is the area of the geometrical 
intersection of the polygon and the scaling function multiplied 
by 2 3 / ' y/ 'N x N y . A method to compute the intersection area for 
rectilinear polygons is described in Algorithm 1 . 

The Haar transform acts as a discontinuity detector, and all 
transform coefficients will be zero except when basis functions 
intersect the boundary of the polygon. Therefore, the basis 
functions completely inside or outside the polygon can be 
ignored. In addition, all coefficients positioned below such 
a basis function in the transform tree are also zero. Fig. 7 
shows, in gray, the support of the basis functions of a given 
scale that yield non-zero inner products. As with the original 
FWT, PCHT can be written as a divide-and-conquer algorithm: 
Divide the tile into four rectangular parts recursively until the 
part under consideration is completely inside or outside the 
polygon. Pseudocode for PCHT can be found in [14]. An 
example of the pruned transform flow-diagram for the ID case 
is shown in Fig. 4 in black. 

2) Complexity: We will now estimate the complexity of 
PCHT. As it is highly dependent on the geometry of the 
polygon, we make the worst-case assumption and describe the 
complexity as a function of a few general properties of the 
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Fig. 7. The Haar basis functions yield zero inner product except when 
they intersect the edge of a polygon. The big thick square is the tile with 
N x = N y = N. The dashed squares are the supports of the basis functions 
at a given scale. The thick polygonal border marks the contour of the polygon. 
Basis functions yielding non-zero inner product are shown in gray. 



polygon, such as its number of vertices K and its perimeter 
P. The complexity is also a function of the computational 
complexity A ia (K) of the intersection area algorithm. The 
exact value depends on the assumptions made regarding the 
type of polygons, but is 0{K). In particular, for rectilinear 
polygons, we have 



11 

6K 



if K = 4 
1 if K > 4 



(9) 



additions, multiplications and comparisons. 

We begin by estimating the number of non-terminating 
recursive calls. A call is non-terminating if the support of the 
waveform intersects the boundary of the polygon. At a given 
scale j, there are roughly \2^P/NA such basis functions. We 
assume here that N x — N y — N. Each of these calls makes 
four calls to itself and thus to the intersection area routine 
and also uses three comparisons. The intersection area routine 
uses in turn eleven additions and three multiplications per non- 
terminating call. Finally, a total of M— 1 additions are needed 
to sum up the scaling coefficients of the polygons, and a single 
multiplication is needed for the final scaling factor. We thus 
formulate our estimate of the complexity of PCHT as 



kpCHT 




VPrr 

N 



(A ia (K m ) + 26) +M 



(10) 

operations (additions, multiplications and comparisons), where 
P m and K m are the perimeter and number of vertices of the 
m* polygon, respectively, M is the number of polygons in a 
given tile and J the maximum level of decomposition. For 
a DHT with N x = N y = N = 2 J , the total number of 
operations is A DHT = |(A^ 2 — 1). The relation between the 
complexity and the actual runtime is discussed in Section IV. 



C. Fast Continuous Fourier Series (FCFS) 

We now develop the FCFS algorithm for computing the 
CFS coefficients of rectilinear polygons. The computational 
complexity of FCFS will also be evaluated and compared with 
that of the FFT 

1) Algorithm Derivation: We begin by a closed-form for- 
mula for the CFS coefficients of rectilinear polygons. Applying 
Lemma 1 to the Fourier basis (4) results in the following 
proposition. 
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TABLE I 

Complexity of the steps of the FCFS algorithm. K = J2m=o Km IS THE SUM 0F THE vertices of the M polygons in the N x x N y tile. 



The parameters P x r \ P x !I \ dividers of N x , and P y r \ P y J \ dividers of N y , are chosen to minimize the overall complexity. 



Step Number of Operations Operation performed 



1 




- M 




Sum of polygon areas 


2 


(§- 




xK+^Q^+l) xC FFT10 (pi /) ) t 


ID -^-input pruned FFT 


3 


(1- 




xK+ (±Qi I) +1) xCfftidIpW 


ID -y-input pruned FFT 


4 


(- 


— + -g-C 


A II) + lQi II) Qi II) ) xK + Qi^aQ^+l) xC FFT2D {P^\p^ 


2D K -input pruned FFT 



Total ( - 2 + §Q« + f Q« + f q("> + f Q<"><J<">) X - M 

+ {^Q ( P +l)xC FFT1D {P { / ) ) + ^Q [ y I) + 1) xC FFT1D (P^) 

+ Qi" ) (5Qi" ) + 1) x C7 FFT2D (Pi /I) ,P^ I) ) Operations 

t ID Split-Radix: C FFT1D (JV) = 4N log 2 N - 6JV + 8 *2D Splil-Radix: C FFT2D (N 1C , N y ) = 4N x N y log 2 N x N y - 12JV X JV B + 8N X + 8N y 



TABLE II 

Summary of the computational complexity of the continuous and discrete transforms of a single rectilinear polygon in terms 

of the total number of additions and multiplications. 



Complexity of the Transforms: 

PCHT « ( J] 1 J £ [^1 (A io (*T;) + 26) ) + M t DHT S^pt 
V «=o j=o 1 1 / 

Gcertzel TD-FFT 

FCFS ((AT a: + i)(JV s , + |) + i|)A--M See Table I FFT ^ (4N x N y log 2 N x N y — 12N x N y + 8AT X + 8ATJ,) t 

t A^ a (K"^ ) given in (9) $ Row-column real splil-radix FFT complexily [241 



Algorithm 1 IntersectionArea('P, Tj^ x u) 

Require: A rectilinear polygon V. The support Tj t k x ,k y of 

the basis function <Pj,k x ,k y - 
Ensure: 7 is the intersection area of "P and Tj^ kx ,k y - 



I 4-0 

oi <- k x N x /2\ h <- fc^/ 2 * 

a 2 <- (fc x + l)JV x /2*, 6 2 «- (fcj, + l)JV„/2* 

for every horizontal edge (xi,yi) — > of "P do 

u <— min(xi, u <— max(xi,Xi + i) 

if a 2 < u or w < ai or j/i < 6i then 
continue 

end if 

s «- sign(x i+1 - xi) 

I 4— I + s(min(t;, a 2 ) — max(u, ai))(min(?/j, 6 2 ) — &i) 
end for 



Proposition 3: Given a polygon "P with if vertices, its CFS 
Fk,U k, I e Z 2 , is given by 



if/2-1 



A,o = "o,o E 2/2i(a;2i+i - x 2 i), 



(11) 



i=0 



ff/2-1 



if/2-1 



Fo,l = ®0J ^2 {X2i+1 - X2i)t 



-jw y ly 2i 



if/2-1 
i=0 



]w y ly 2 , 



^ e -jw x kx 2 i+i _ e ~jw x kx 2 i^ 



(14) 



where the scaling factor a k ,i is defined as follows 

if k = I = 
if fc ^ and / = 
if k = and I ^ 
if fc ^ and Z ^ 



J 

27rfe • 



J 
2tt/ 



ffc,o=«*,o E tei-i-lteOe-* -** 2 ', (12) 



(13) 



i=0 



It turns out that, except for Fo,o, all coefficients can be 
computed using DFTs of very sparse signals as we restrict the 
vertices to lie on the integer lattice. We can decompose the 
algorithm into four steps, with each step computing one of 
(11) to (14). For simplicity, we consider only the transform of 
a single polygon here. 

Step 1: Directly compute (11). It is the scaled area of V. 

Fo,o — a o,a Area('P) . This corresponds to line 11 and 18 

in the Algorithm 2. 
Step 2: Compute (12) as a ID DFT: F M = a kfi DFT fe ' j/^j, 

where k! = k mod N x and 

fx[n] = E (y*- 1 ~ n = 0,...,N x -l, 

iex n 

where X n — {i \ Xi = n mod N x }. The values (yi-i — yt) are 
the lengths of the vertical edges. This corresponds to line 5 
and 16 in the Algorithm 2. 
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Step 3: Compute (13) as a ID DFT: F ,i = a ,i DFTV 
where V = 1 mod N y and 

fy[n]= ^2(xi+i-Xi), n = 0,...,N y -l, 

where y n = {i | j/j = n mod A^}. The values (ajj+i — x{) are 
the lengths of the horizontal edges. This corresponds to line 
8 and 17 in the Algorithm 2. 
Step 4: Compute (14) as a 2D DFT: F k j = 

atk,i DFTk'M j/xK j. where fc' = k mod A^, i' = I mod A^ 
and 

fsT/2-1 

f x , y [m,n]= ^2 ( 1 z m:n (x 2 t+i,y2i) - lz m . n (x2i,V2i)) 

i=0 

where Z m . n = {(x,y) \ (x,y) E Z 2 , x = m mod N x , y = 
n mod N y }. This is a sparse N x x N y image with l's and 
— l's placed at the vertices. This corresponds to line 8, 9 and 
15 in the Algorithm 2. 

For multiple polygons, we can use (7) for Step 1, and the 
linearity of the DFT is used to include all the polygons in 
f x , fy and f xy for the three other steps. The four steps are 
illustrated in Fig. 6. Algorithm 2 gives the pseudocode of 
FCFS. In practice, it is often possible to combine the creation 
of fx, fy and fxy with the computation of the DFTs, in which 
case no full-length arrays need to be stored. 

Having reduced the problem of computing the CFS to a 
few real DFTs of sparse signals, we can exploit the extensive 
collection of available DFT algorithms. The FFT algorithm 
[25] is not the most efficient in our case, as we could also 
use a single FFT on a sampled version of fx, albeit with 
some loss of precision. If we are only interested in a few CFS 
coefficients, we can apply a Goertzel-like algorithm [26] to the 
direct computation of the CFS coefficients using Proposition 3. 
The pruned FFT is the most appealing approach for the 
computation of a large number of CFS coefficients, and in 
particular the transform decomposition FFT (TD-FFT) as it 
is the fastest of all existing pruned FFT algorithms [27], [28]. 
The complexity analysis considers both TD-FFT and Goertzel, 
while the implementation is focused solely on TD-FFT. 

2) Complexity: As in the Haar case, the computational 
complexity depends on polygon geometry. In this case, only 
K, the sum of the number of vertices of the M polygons 
present in the tile, is important. 

We compare in terms of complexity FCFS with the split- 
radix FFT [24] performed on the discrete image of a polygon. 
We choose the split-radix FFT algorithm for it is one of the 
fastest and most widely used FFT algorithms. For FCFS, we 
consider both Goertzel and TD-FFT algorithms to compute 
the sparse DFTs. 

For the complexity analysis, consider the transform of a 
N x x N y tile, where N x and N y are composite numbers. 
Table I details the complexity of each step of the FCFS 
algorithm described in Section III-C1. As observed in [28], 
we need to choose the TD-FFT sub-FFTs lengths P^K Py T \ 
Px^ and Py 11 ^ such that the overall complexity is minimized, 
with the constraint that they are dividers of N x and N y , 



Algorithm 2 FCFS(7', N x , N y ,F) 

Require: "P contains the lists of the vertices of the M recti- 
linear polygons where {x m ^,y m ^) is the i th , out of K m , 
vertex of the m th polygon V m . For all m : x m fi ^ x m ^. 
The tile size is A^ x N y . TD-FFT- ID and -2D take as 
inputs the DFT length and a sparse array whose non-zero 
entries are stored in V, and their locations in S. 
Ensure: F is an N x N matrix containing the N x N first 
Fourier series coefficients of the polygon V . 

1: n <- 0; A <- 

2: for to = to M - 1 do 

3: for i = to K m — 1 do 

4: if x m .i = x m . i+1 then 

5: V K [n] <- y m>i - y m ,i+i; S K [n] <- x m ^ 

6: n n + 1 

7: else if y m .i = y m ,i+i then 

8: V L [n] <- x m ^ +1 - x m .i; S L [n] <- y mA 

9: V KL [2n] <r- 1; S KL [2n] <- {x m . t+l ,y m>i ) 

10: V KL [2n + l] 4r- -1; S KL [2n + l] <- {x m ,i,y m ,i) 

Hi A "s— A + y m ,i(Xm,i+l ~ X m .i) 

12: end if 
13: end for 
14: end for 

15: F[k, 1} «- a kil TD-FFT-2D (V KL , S KL , N x , N v ) 
16: F[k, 0] <r- a k . TD-FFT-1D (V K , S K ', A^^) 
17: F[0, 1} <- a j TD-FFT-1D (V L , S* L , N y ) 
18: F[0,0] <- ao,nA 



respectively. There is however no closed-form solution to this 
problem. In addition, this optimization requires knowledge of 
the specific FFT algorithm used for the sub-FFTs in TD-FFT, 
which is impossible with modern libraries such as FFTW. 
Therefore, the lengths of the sub-FFTs need to be chosen such 
that the runtime on a given architecture is minimized. This can 
be done offline, and the optimal lengths stored in a look-up 
table. 

For the sake of analysis, we assume that the split-radix FFT 
algorithm is used for all FFTs. The complexity of the 2D 
N x x N y complex split-radix FFT is 

C F ft2D = AN x Ny log 2 N X N V - Y2N x N y + SN X + 8N y 

real operations. Real-valued data implies the need for roughly 
half this number of operations. The Goertzel algorithm re- 
quires approximately 0(KN x N y ) real operations for an N x x 
N y -point real DFT with K non-zero inputs. Our algorithm 
requires the computation of the area of the polygons (step 1), 
two length- N DFTs with K/2 inputs each (steps 2 and 3) 
and one 2D length-A^ x N y DFT with K inputs (step 4). 
The exact complexity of FCFS is given in Table I. Table II 
provides a summary of the computational complexity of all 
the transforms. 

IV. Performance Evaluation of PCHT and FCFS 

In this section, we first describe how PCHT and FCFS are 
implemented, and then present results benchmarked on real 
VLSI layouts, comparing the performance to that of traditional 
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Fig. 8. Median runtime (left ?/-axis, data points) and complexity (right y-axis, lines) of the PCHT (x, dashed line) and DHT (•, solid line) of 1024nmx 1024nm 
tiles from the Ml layer containing K vertices. Note the accuracy of our complexity estimate in predicting the qualitative behavior of the PCHT (superposition 
of crosses and dashed line). Even the outliers around K = 190 are predicted. Tiles with K > 200 have a lower complexity as they contain only rectangles, 
which are less complex. However, the complexity is underestimated as it does not take memory transfers into account. The empirical distribution of the 
number of vertices is shown in gray. 



Speed-up 

30xi 1 




1x1- i i i i - 

128x128 256x256 512x512 1024x1024 2048x2048 4096x4096 

Tile Size 

Fig. 9. Speed-up of the average runtime provided by the PCHT on layers 
Ml (x), M2 (♦), CA (•). The higher the better. 1 times speed-up means 
same performance as the DHT. We see that PCHT is at least 5 times faster 
than the DHT. Confidence intervals have been omitted in this figure as in the 
worst case they were found to be within ±0.06 of the value given, with 95% 
confidence. 

discrete transform algorithms. This performance evaluation has 
two goals. The first is to validate the theoretical computational 
complexity and analyze the behavior of the runtime as a 
function of the number of vertices K in a tile. The second 
goal is to measure the improvement in runtime provided by 
the PCHT and FCFS over the DHT and FFT, respectively. 

A. Implementation and Setup of the Benchmark 

All algorithms were implemented in a computational lithog- 
raphy tool, and run on a 3GHz Intel© Xeon 5450 running 
Linux© in 64-bit mode. All code is C++, single-threaded and 
was compiled using GCC 4.1.2 with option "-03". The tool 
takes a layout file as input, parses it, chops polygons and 
places them in their corresponding tiles. Each tile is then 
transformed individually. Fig. 6 shows flow diagrams of the 
different steps involved in the process of transforming a layout 
using PCHT, DHT, FCFS and FFT. The transform is initially 
performed twice to get the machine into steady state, and 
then repeated 10 more times to average out the timing noise. 
Both PCHT and DHT, as described in Section III-B1, were 
fully custom-implemented. The FFT was performed using the 
FFTW3 library [29]. FCFS was custom-implemented using the 
TD-FFT algorithm for pruned FFTs, which in turn use FFTW3 
for the sub-FFTs. For the discrete transforms, an image of the 
tile is first created and then fed to the transform algorithm. 
The time needed to create the discrete image is added to the 
runtime of the discrete transform. 

For the evaluation, the algorithms were run on three lay- 
ers from a 22nm layout of modest size (0.43mmx 0.33mm) 



containing rectangles and more complex rectilinear polygons. 
These layers are Ml and M2, which contain both rectangles 
and other polygons, and CA, which contains only rectangles, 
as shown in Fig. 2. We ran the experiment on squared tiles, 
where side lengths were powers of two from 128nmxl28nm 
to 4096nmx4096nm. 

This experiment has two distinct goals. The first is to 
validate the theoretical complexity derived in Section III as 
a predictor of the behavior of the runtime of the trans- 
forms. To that effect, we use the runtime from Ml tiled in 
1024nmx 1024nm. We compute the average of the 10 runs 
for each tile and for all transforms. We then take the median 
of this average over all tiles containing a given number of 
vertices K. This is shown in Fig. 8 and Fig. 10. We plot K 
on the x-axis, the median runtime on the left y-axis, and the 
complexity on the right y-axis. The plot also shows in light 
gray the empirical distribution of K to indicate which range 
of K is the most important one. 

The second is to measure relative difference of runtime 
between PCHT and FCFS, and their respective discrete coun- 
terparts. To that effect, we aggregate the results to get the 
average of 10 runtimes over a full layer, for all layers and all 
transforms. Our metric is the speed-up, computed by dividing 
the average runtime of the discrete transform by the average 
runtime of the corresponding continuous transform for a given 
layer and tile size. This is shown in Fig. 9 and Fig. 11. We 
plot the speed-up of the average runtime for all layers and tile 
sizes considered. 

B. Benchmark Results 

Fig. 8 shows the runtime and the complexity of the PCHT 
and the DHT as a function of K for 1024nmx 1024nm tiles 
from Ml. The left and right y-axis show the runtime and 
complexity, respectively. We observe that the complexity given 
by (10) describes the qualitative behavior of the runtime of the 
PCHT very well, even for the outliers around K = 190. When 
K > 200 and the tiles are composed exclusively of rectangles, 
(10) underestimates the complexity. In this case, the runtime 
is dominated by memory transfers. 

The gap between the runtime of the DHT and its complexity 
is also explained by the domination of memory transfers, 
which are not accounted for in the computational complexity 
in (10). The dependence of the DHT on K stems from discrete 
image creation and from the fact that our implementation uses 
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Fig. 10. Median runtime (left y-axis, data points) and complexity (right j/-axis, lines) of the FCFS (□, dashed line) and FFT (o, solid line) of 1024nmx 1024nm 
tiles of the Ml layer containing K vertices. We observe that the gap between the runtimes of the FCFS and the FFT is larger than that of their respective 
complexities. The reason being that the pruned FFT can be implemented more efficiently than the FFT as shown in [30]. 



Speed-up 

3.5X| 1 — 




1xh i | | | ,— - 

128x128 256x256 512x512 1024x1024 2048x2048 4096x4096 

Tile Size 

Fig. 1 1 . Speed-up of the average runtime provided by the FCFS on layers 
Ml (x), M2 (♦), CA (•). The higher the better. 1 times speed-up means 
same performance as the FFT. We see that PCHT is at least 1.4 times faster 
than the FFT. Confidence intervals have been omitted in this figure as in the 
worst case they were found to be within ±0.012 of the value given, with 
95% confidence. 

an if statement to avoid storing zero transform coefficients 
whose number decreases with increasing K. Given the very 
high number of zero coefficients, the cost of the ;/ statement 
is justified by the large memory transfer savings. Overall, this 
figure shows that for small values of K encountered in VLSI 
layouts, the PCHT is significantly faster than the DHT. 

Fig. 9 shows that the average runtime of the PCHT for a 
full Ml layer is about 5 times shorter than that of the DHT, 
for all considered tile sizes. For the CA layer, which contains 
only rectangles that have a lower complexity, the speed-up is 
25 fold for large tiles. 

Fig. 10 shows the runtime and the complexity of FCFS and 
the FFT as a function of K for 1024nmx 1024nm tiles from 
Ml. The left and right y-axis show the runtime and complexity, 
respectively. The runtime of FCFS compared with that of the 
FFT is lower than expected given the theoretical complexity. 
Indeed, as shown by Franc hetti and Piischel [30], the pruned 
FFT can be implemented more efficiently than the FFT. Here 
again, the FCFS is significantly faster than the FFT for small 
values of K found in VLSI layouts. 

The speed-up achieved by FCFS over the FFT, shown in 
Fig. 11, is similar for all considered layers. The M2 layer 
shows a slightly higher speed-up because its vertex density 
is lower than that of other layers. For all layers, the highest 
speed-up found is for 1024nmx 1024nm tiles, for which the 
FCFS is about 3 times faster than the FFT. In contrast, the 
results for 128nmxl28nm and 256nmx256nm show only a 
modest speed-up of about 1.5. 

For all speed-up results, confidence intervals were computed 
at the 95% level but found to be negligible and have thus been 



TABLE III 

Average runtime of the transforms of an Ml layer divided 
into 1024nmx 1024nm tiles. There are 127544 non-empty tiles 
for a total of 995497 rectangles and 428817 other polygons. 
Runtime per tile is in /^-seconds. Total runtime is in seconds. 



Average runtime (with 95% confidence intervals) 

PCHT DHT FCFS FFT 

Per Tile [ M s] 919±0.6 4375±12.8 15648±10 47341±8 

Total [s] 117.2±0.2 558±4.5 1996±4 6038±3 



omitted in the figures. However, they can be found in Table III, 
which contains average runtime values of all considered trans- 
forms, for the Ml layer divided into 1024nmx 1024nm tiles, 
both per tile and for the whole layer. For the 127544 tiles of 
Ml, the runtime is found to be on average about 5 times lower 
for the PCHT and 3 times lower for the FCFS, respectively, 
than for their discrete counterparts. Although these numbers 
might at first glance seem modest, the runtime is reduced from 
about 9min for the DHT to only 2 using PCHT, whereas the 
FCFS runs in 30min instead of lh40min for the FFT. These 
are significant time savings. 

V. Conclusions and Future Work 

We developed PCHT and FCFS, two new fast algorithms 
for the computation of the continuous Haar transform and 
continuous Fourier series, respectively, of patterns of rec- 
tilinear polygons, as are typically found in VLSI layouts. 
We showed that the sparsity of the polygon description can 
indeed be exploited by continuous transforms, resulting in 
significant speed-up of the transform coefficients computation. 
The complexities of the two algorithms were analyzed and 
compared with that of their discrete counterparts. They were 
found to be lower in complexity when the number of vertices 
of the polygons is low, as is the case in VLSI layouts. We 
validated this analysis by implementing all algorithms in a 
computational lithography software and running a performance 
evaluation on VLSI layouts. The results not only confirmed 
the validity of our complexity analysis, but also showed 
that continuous transforms can actually be implemented more 
efficiently than their discrete counterparts. This is achieved 
thanks to a more compact description of the input signal, 
allowing a better usage of the available cache and reducing 
memory transfers. We measured the gain in performance using 
a speed-up metric. PCHT was found to run at least 5 times 
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faster than the DHT for all considered layers and tile sizes. A 
maximum speed-up of 30 times was achieved in the case of 
the CA layer divided in 4096nmx4096nm tiles. FCFS showed 
least improvement for small tiles size where it is only 1 .5 times 
faster than the FFT. However, it showed a peak performance 
for 1024nmx 1024nm tiles where it is over 3 times faster than 
the FFT. These speed-ups result in significant time savings 
due to the magnitude of the problem at hand. We therefore 
conclude that the PCHT and FCFS are superior to the DHT 
and FFT, respectively, for rectilinear polygons in VLSI layouts. 

In lithography, often only low-pass Fourier coefficients are 
needed, and it would be of high interest to adapt FCFS to 
yield only this low-pass spectrum. For the sake of speed, 
it is currently common practice to employ decimation using 
a crude low-pass filter, such as a Haar scaling function, 
followed by an FFT, accepting the inevitable large imprecision. 
Straightforward adaptation of FCFS using input and output 
pruned FFT, or even Goertzel, to yield only the low-pass 
unaliased spectrum does not compete in terms of speed with 
the aforementioned decimation scheme. More sophisticated 
spectral methods need thus to be investigated to find a suitable 
trade-off between speed and accuracy. 

We designed PCHT and FCFS with highly suitable struc- 
tures for parallelization. Given the trend towards multi- and 
manycore architectures, such a parallel implementation would 
be a natural next step. Given the close relationship between 
the Fourier and cosine transforms, straightforward extension of 
the FCFS algorithm to the continuous cosine series is possible. 
A further research avenue would be to apply continuous 
transforms to other application domains. For example, relaxing 
the signal model to non-rectilinear polygons or even radically 
different sparse parametric signals. 
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